3. Linear Models

  • motivating case study

  • linear models

  • sparse linear models

  • bias variance trade-off

Reading

  • Sections 10.4 - 10.6.. Yu, B., & Barter, R. L. (2024). Veridical data science. London, England: MIT Press. https://vdsbook.com/10-ls_continued

Learning outcomes

  1. Describe the theoretical foundation of intrinsically interpretable models like sparse regression, gaussian processes, and classification and regression trees, and apply them to realistic case studies with appropriate validation checks.

  2. Compare the competing definitions of interpretable machine learning, the motivations behind them, and metrics that can be used to quantify whether they have been met.

Drug response prediction

  • Patients with the same diagnosed cancer often respond very differently to the same drug. How can we figure out which drugs any particular patient will respond to?

  • If we imagine drug effectiveness = f(gene activity), then one approach is to measure activity of genetic pathways in biopsies of that patient’s cancer tissue.

  • The most important features in that model can be used to stratify patients into different responder/non-responder subtypes.

Study design

  • To learn this relationship, previous studies used “immortalized” cancer cell lines. These don’t fully represent the complexity of cancer in real populations.

  • The study (Dietrich et al. 2017) measured drug responses in samples from primary patients who were being treated for blood cancer (CLL). They simultaneously measured gene expression and DNA methylation activity (we will skip over some the biology, but happy to discuss this with anyone who’s interested).

Drug response data

Features - 121 samples from CLL patients - 61 drugs, 5 dosages per drug - 9553 features total

Outcome - Cell viability: Percentage of surviving cells after drug exposure

Main Question



For a given drug treatment profile, which RNA or methylation features differentiate between drug sensitivity vs. resistance?

Review Code Example

  • 01-linear_model.qmd: “Exploration”

Sparse Linear Regression

  • Viability can be viewed as a response variable \(\mathbf{y} \in \mathbf{R}^{N}\), and the molecular variables can be treated as features \(\mathbf{X} \in \mathbf{R}^{N \times J}\). Here

  • The setting is high-dimensional with fewer samples (\(N = 121\)) than features (\(J = 9553\)). Without regularization, the problem is underdetermined.

  • Sparsity will help us focus on the most important pathways out of thousands of candidates.

This is a case where lasso regression was important in a real scientific study.

Linear Regression Review

Single continuous predictor

\[\begin{align*} y_i=\beta_0+x_{i 1} \beta_1+\epsilon_i \end{align*}\]

The least-squares estimate \(\hat{\beta} := \left(\hat{\beta}_{0}, \hat{\beta}_{1}\right)\)​ is found by minimizing

\[\begin{align*} \min_{\beta_0, \beta_1} \sum_{i = 1}^{N}\left(y_i-\beta_0-x_{i} \beta_1\right)^2 \end{align*}\]

Examples

  1. In the reading, model house price \(y_{i}\) as a function of house area \(x_{i}\).

  2. In the case study, model viability \(y_{i}\)​ for sample \(i\) as a linear function of a single gene’s expression level \(x_{i} \in \mathbf{R}\):

Sketch

Each choice of \(\beta_{0}, \beta_{1}\) is associated with a different straight line and a different loss value.

Sketch

Each choice of \(\beta_{0}, \beta_{1}\) is associated with a different straight line and a different loss value.

Empirical Loss Surface

For a given dataset, loss value across all choices of \(\beta_{0}, \beta_{1}\) is a quadratic function. The minimizer is the least squares solution.

Single categorical predictor

If a variable includes \(K\) categories, it can be one-hot encoded into \(K - 1\) binary columns,

\[\begin{align*} x_{ik} = \mathbf{1}\{\text{sample } i \text{ belongs to level } k\} \end{align*}\]

The left-out category is the reference level.

Single categorical predictor

If a variable includes \(K\) categories, it can be one-hot encoded into \(K - 1\) binary columns,

The left-out category is the reference level.

Examples

In the reading, \(x_{i} \in \{\text{Gilbert}, \text{North Ames}, \text{Edwards}, ...\}\) records the neighborhood for house \(i\).

  • \(\beta_0\)​: The typical price in the reference “Somerset” neighborhood.

  • \(\beta_1\): The amount the predicted price changes when moving from “Somerset” to “Gilbert”

  • \(\beta_{2}\): The amount the predicted price changes when moving from “Somerset” to “NAmes”

and similarly for the remaining neighborhoods.

Examples

In the case study, \(x_{i} \in \{\text{no mutation}, \text{mutated}\}\) could record whether the patient has a mutation in a particular gene.

  • \(\beta_0\)​: The expected viability for the reference group (\(x_{i} = 0\)).

  • \(\beta_1\): The difference in expected viability between the mutated \(x_{i} = 1\) and reference groups

Multiple Linear Regression

Assumed model form:

\[\begin{align*} y_{i} &= \sum_{j = 1}^{J}x_{ij}\beta_{j} + \epsilon_{i} \\ &:= \mathbf{x}_{i}^\top \beta + \epsilon_{i} \end{align*}\]

  • \(\epsilon_{i}\) represents random variation due to unmeasured factors.

Fitting Multiple Linear Regression

  • We can estimate \(\hat{\beta} \in \mathbf{R}^{J}\) by minimizing the sum of squares loss,

\[\begin{align*} \min_{\beta} \sum_{i=1}^N \left( y_i - \mathbf{x}_i^\top \beta \right)^2 \end{align*}\]

Visualization: Two continuous predictors

We can imagine how viability changes when changing the expression levels for two genes simultaneously.

Coefficient Interpretation

Ceteris Paribus

  • “All other things being equal”.

  • \(\beta_j\)​ gives the impact of changing \(x_j\) while every other feature \(k\) in the model is fixed.

Example

In the housing price example,

\[\begin{align*} \text{predicted price} = &-871,630 + 88 \times \text{area} + 19,129 \times \text{quality} + \\ &426 \times \text{year} - 12,667 \times \text{bedroom} \end{align*}\]

For every additional square foot, the price increases by $88.

Caution: Extrapolation

In the housing price example,

\[\begin{align*} \text{predicted price} = &-871,630 + 88 \times \text{area} + 19,129 \times \text{quality} + \\ & 426 \times \text{year} - 12,667 \times \text{bedroom} \end{align*}\]

When all the covariates are 0, the predicted price is negative. This makes no sense. But there are also no 0 square foot homes for the model to have learned this.

Caution: Context

The coefficient values must be interpreted within the context of all other predictors.

\[\begin{align*} \text{predicted price} = &-871,630 + 88 \times \text{area} + 19,129 \times \text{quality}+ \\ &426 \text{year} - 12,667 \times \text{bedroom} \end{align*}\]

\[\begin{align*} \text{predicted price} = &-750,097 + 37,765 \times \text{quality} + 335 \times \text{year} + \\ & 13,935 \times \text{bedroom} \end{align*}\]

This instability in coefficient interpretations is most severe when predictors are correlated with one another.

Caution: Context

In the case study, genes are correlated when they lie on the same pathway. Holding other genes “fixed” is not realistic. Estimates will change if any genes are dropped.

Correlated genes example 1

Correlated genes example 2

Caution: Standardization

Large coefficient \(\neq\) an important predictor.

  • The scale of the original features influences the size of the coefficients.
  • One solution is to standardize the input features.
  • Alternatively, use the bootstrap to estimate \(\text{SD}\left(\beta_{j}\right)\). Large + stable coefficients are more relevant.

Discussion: Linear Model Interpretability

Respond to [Linear Model Interpretability] in the exercise sheet.

Regularization

Definition

Regularizing a predictive model means forcing it towards a simpler solution. This is usually achieved by adding penalizers to the optimization objective that used in estimating the model parameters.

Why Regularize? Improving Stability

When features are correlated, the loss surface has long “valleys” where any of the solutions look equally good.

This can lead to instability in the resulting fits.

\(\ell^{2}\) Regularization

One way to address this is to add a an \(\ell^{2}\)-penalty to the least-squares objective.

\[\begin{align*} \min_{\beta \in \mathbf{R}^J} \left[ \frac{1}{2N} \sum_{i=1}^N \left(y_i - \mathbf{x}_i^\top \beta\right)^2 + \lambda \lVert \beta \rVert_{2}^{2} \right]. \end{align*}\] This is the same loss as linear regression, but with a new \(\ell^{2}\) penalty \(\|\beta||_{2} = \sum_{j} \beta_{j}^{2}\) is the \(\ell^{2}\) norm. \(\lambda \geq 0\) is a tuning parameter controlling model complexity.

\(\ell^{2}\) Regularization

This method is called ridge regression. Geometrically, this penalty encourages \(\beta\) to be closer to the origin.

\(\ell^{2}\) Regularization

This method is called ridge regression. Geometrically, this penalty encourages \(\beta\) to be closer to the origin.

Why Regularize? Removing Irrelevant Predictors

  • If we had many noise features (unrelated to response), least squares will still try finding coefficients for each of them. This causes overfitting: our predictions would depend on variables that don’t actually matter.

  • We don’t expect all genes to be relevant to cell viability in this experiment. It’s more likely that a few key pathways are driving resistance.

  • Feature selection: If the noise variables do not reduce the SSE, then the Lasso sets their coefficients \(\beta_{j}\) to exactly zero. The \(\ell^{1}\) penalty “induces sparsity.”

\(\ell^{1}\) Regularization

The Lasso objective is \[\begin{align*} \min_{\beta \in \mathbf{R}^J} \left[ \frac{1}{2N} \sum_{i=1}^N \left(y_i - \mathbf{x}_i^\top \beta\right)^2 + \lambda \lVert \beta \rVert_1 \right]. \end{align*}\] This is like ridge regression but with a new \(\ell^{1}\) penalty

\[\begin{align*} \|\beta\|_{1} := \sum_{j = 1}^{J} \left|\beta_{j}\right| \end{align*}\]

\(\ell^{1}\) Regularization

It’s not obvious, but the minimizers often set coordinates \(\beta_{j}\) to exactly zero. The “selected” features are those where \(\beta_{j} \neq 0\).

\[\begin{align*} \min_{\beta \in \mathbf{R}^J} \left[ \frac{1}{2N} \sum_{i=1}^N \left(y_i - \mathbf{x}_i^\top \beta\right)^2 + \lambda \lVert \beta \rVert_1 \right]. \end{align*}\]

Loss Surface

The minimizers are encouraged to lie in the “creases” where some \(\beta_{j}\) are exactly zero.

Loss Surface

The minimizers are encouraged to lie in the “creases” where some \(\beta_{j}\) are exactly zero.

Loss Surface

The minimizers are encouraged to lie in the “creases” where some \(\beta_{j}\) are exactly zero.

Exercise

Respond to the following T/F questions from the reading on linear model extensions. Justify your choices.

  1. The magnitude of the LS coefficient of a predictive feature corresponds to how important the feature is for generating the prediction.

  2. Increasing the number of predictive features in a predictive fit will always improve the predictive performance.

  3. More regularization means that regularized coefficients will be closer to the original un-regularized LS coefficients.

Dietrich, Sascha, Małgorzata Oleś, Junyan Lu, Leopold Sellner, Simon Anders, Britta Velten, Bian Wu, et al. 2017. “Drug-Perturbation-Based Stratification of Blood Cancer.” Journal of Clinical Investigation 128 (1): 427–45. https://doi.org/10.1172/jci93801.